Krupovic et al. BMC Biology 2014, 12:36 
http://www.biomedcentral.eom/1 741 -7007/1 2/36 



BMC Biology 



RESEARCH ARTICLE Open Access 



Casposons: a new superfamily of self-synthesizing 
DNA transposons at the origin of prokaryotic 
CRISPR-Cas immunity 

Mart Krupovic 1 , Kira S Makarova 2 , Patrick Forterre 1 , David Prangishvili 1 and Eugene V Koonin 2 



Abstract 

Background: Diverse transposable elements are abundant in genomes of cellular organisms from all three domains 
of life. Although transposons are often regarded as junk DNA, a growing body of evidence indicates that they are 
behind some of the major evolutionary innovations. With the growth in the number and diversity of sequenced 
genomes, previously unnoticed mobile elements continue to be discovered. 

Results: We describe a new superfamily of archaeal and bacterial mobile elements which we denote casposons 
because they encode Cas1 endonuclease, a key enzyme of the CRISPR-Cas adaptive immunity systems of archaea 
and bacteria. The casposons share several features with self-synthesizing eukaryotic DNA transposons of the 
Polinton/Maverick class, including terminal inverted repeats and genes for B family DNA polymerases. However, 
unlike any other known mobile elements, the casposons are predicted to rely on Casl for integration and excision, 
via a mechanism similar to the integration of new spacers into CRISPR loci. We identify three distinct families of 
casposons that differ in their gene repertoires and evolutionary provenance of the DNA polymerases. Deep 
branching of the casposon-encoded endonuclease in the Cas1 phylogeny suggests that casposons played a pivotal 
role in the emergence of CRISPR-Cas immunity. 

Conclusions: The casposons are a novel superfamily of mobile elements, the first family of putative self-synthesizing 
transposons discovered in prokaryotes. The likely contribution of capsosons to the evolution of CRISPR-Cas parallels the 
involvement of the RAG1 transposase in vertebrate immunoglobulin gene rearrangement, suggesting that recruitment 
of endonucleases from mobile elements as ready-made tools for genome manipulation is a general route of evolution 
of adaptive immunity. 
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Background 

Cellular organisms in the three domains of life are under 
constant onslaught of invading mobile genetic elements 
(MGE), such as transposons, viruses and plasmids. Many, 
if not most, of these diverse selfish elements insert into 
the chromosomes of the cellular hosts, either as an obli- 
gate part of their life cycles or at least occasionally, and 
in multicellular eukaryotes constitute a substantial pro- 
portion of the host genome. For example, sequencing of 
the human genome has shown that transposons or relics 
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thereof amount to 35% to 50% of the genome [1,2] and 
subsequent analyses have only revised these estimates up- 
ward [3,4]. Even more strikingly, in some green plants, 
MGE-derived DNA seems to represent more than 70% of 
the genome [5,6]. Although not as abundant as in eukary- 
otes, proviruses and other MGE constitute up to 30% of 
some bacterial genomes [7,8]. The effects of MGE inte- 
gration vary from beneficial (gain of new phenotypic traits, 
such as antibiotic resistance or toxin production) to dele- 
terious (disruption or inactivation of essential cellular 
genes upon MGE insertion) [7,9-11]. For most prokaryotic 
plasmids and viruses, the circular form of the MGE gen- 
ome is inserted into specific loci (site-specific integration) 
of the cellular chromosome with the aid of MGE-encoded 
enzymes known as integrases [12]. The integrases are 
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grouped into two major families on the basis of sequence 
conservation and mechanistic relatedness: (1) tyrosine 
recombinases use a catalytic tyrosine residue which at- 
tacks the DNA and becomes covalently linked to it during 
strand exchange; and (2) serine recombinases for the same 
purpose use a nucleophilic serine residue [12]. 

Transposons are DNA segments that move from one lo- 
cation in the host genome to another. Although several 
classification schemes have been proposed, based on the 
nature of the transposition intermediate, transposons can 
be generally grouped into two classes [10,13] or types [14]. 
Class I (or type 2) elements — also known as retrotranspo- 
sons — transpose via an RNA intermediate which prior to 
integration is copied back to the DNA form by the 
element-encoded reverse transcriptase. Class II (or type 1) 
DNA transposons move in the genome by the so-called 
'cut-and-paste' mechanism whereby the transposon is 
excised from its initial location and inserted into a new 
genomic locus. Most of the class II transposons possess 
characteristic terminal inverted repeats (TIR) but differ 
widely in terms of the transposases they encode, the spe- 
cific mechanisms of transposition, the element size and 
gene content [10,13,15]. Although most transposases 
belong to the DDE superfamily (named after two aspartate 
and one glutamate residues that form the catalytic triad 
of these enzymes) [10,13,16], some transposons encode 
transposases homologous to the rolling-circle replication 
initiation endonucleases found in numerous viruses and 
plasmids [17-19], to phage integrase-like tyrosine recombi- 
nases [20,21] or to the serine integrases/invertases [22]. 
Furthermore, some bacterial and eukaryotic viruses encode 
transposases that are involved in the integration of the viral 
genome into the host chromosome, thereby partially blur- 
ring the distinction between different MGE types [23-25] . 

A distinct group of MGE consists of large (15 to 20 kb), 
self-synthesizing DNA transposons, called Mavericks or 
Polintons [26,27]. The defining feature of Polintons/ 
Mavericks is that they encode their own protein-primed 
type B DNA polymerase which is most likely involved in 
the transposon replication (hence 'self-synthesizing' trans- 
posons) [26] . In addition, these transposons encode several 
hallmark viral proteins, the genome packaging ATPase 
and protease. Recently, we have shown that Polintons/ 
Mavericks also encode major and minor capsid proteins, 
suggesting that these elements combine features of bona 
fide viruses and transposons [24]. Polintons/Mavericks are 
widespread in diverse unicellular and multicellular eukary- 
otes [26,27]. In contrast, no such self-synthesizing DNA 
transposons have been described in prokaryotes. 

To survive the proliferation of various MGE and to 
maintain genetic integrity, cellular organisms have evolved 
numerous defense lines, including a variety of innate and 
adaptive immunity mechanisms [28-30]. Although once 
considered to be characteristic exclusively of animals, 



adaptive immunity has been recently discovered in bacteria 
and archaea [30-33]. This system consists of arrays of clus- 
tered regularly interspaced short palindromic repeats 
(CRISPR) and CRISPR-associated proteins (Cas) and elicits 
interference against foreign nucleic acids by degrading 
them in a sequence-specific fashion. The specificity is en- 
sured by the unique spacers homologous to viral or plas- 
mid DNA and integrated into the CRISPR loci. The action 
of the CRISPR-Cas system can be divided into three stages. 
The first stage, called adaptation, involves insertion of for- 
eign DNA spacers into the CRISPR repeats. This step is 
mediated by the two most conserved core proteins of the 
CRISPR-Cas system, Casl and Cas2 [34-36]. Although the 
mechanistic details of adaptation remain poorly under- 
stood, it has been demonstrated that Casl is the endo- 
nuclease responsible for the excision of the protospacer 
from the foreign DNA and its insertion into the CRISPR 
cassette [36-40]. During the second stage, expression and 
processing, the CRISPR locus containing the arrays of 
spacers is transcribed, producing a long pre-crRNA 
(CRISPR RNA), which is subsequently processed by Cas 
proteins into short guide crRNAs. The final stage is called 
interference and involves degradation of the alien DNA or 
RNA by the Cas enzymatic machinery guided by the bound 
crRNA [30,31,35]. Phylogenomic analyses of the Cas pro- 
teins from diverse archaea and bacteria yielded a wealth of 
information on the diversity and evolution of the CRISPR- 
Cas immunity [33]. However, it remains unclear how this 
sophisticated defense system emerged in the first place. 

Here, we describe the discovery and characterization of a 
new superfamily of MGE that possess several features re- 
sembling the eukaryotic self-synthesizing DNA transposons 
but are integrated in the genomes of various archaea and 
some bacteria. Along with family B DNA polymerases 
(PolB), that are related either to viral protein-primed PolBs 
or to typical archaeal PolBs, these elements, which we 
denote 'Casposons! encode Casl proteins of a distinct sub- 
family. We propose that, different from other known MGE, 
casposons utilize Casl endonucleases for integration into 
the host genomes via a mechanism resembling that of spa- 
cer integration by CRISPR-Cas systems. Given that Casl is 
a key enzyme of the CRISPR-Cas immunity and consider- 
ing the deep branching of casposon homologs in the Casl 
phylogeny, casposons appear to have played a pivotal role 
in the origin of the adaptive immune system in prokaryotes. 

Results 

Genomic islands containing stand-alone casl genes 

A recent comparative genomic survey of cas genes re- 
vealed two distinct groups of casl genes that are not as- 
sociated with CRISPR loci and form two distinct clades 
in the Casl phylogeny (hereinafter 'Casl-solo') [33]. The 
first Casl-solo group was exclusively found in members 
of the archaeal order Methanomicrobiales and did not 
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show any evidence of horizontal gene transfer (HGT) 
whereas the second group displayed a more patchy 
distribution. Most of the group 2 members were from 
the euryarchaeal class Methanomicrobia; however, sev- 
eral representatives were also detected in members of 
Thaumarchaeota as well as in the hyperthermophilic 
euryarchaeon Aciduliprofundum boonei affiliated with 
the order Thermoplasmatales [33]. We hypothesized 
that Casl-solo might be ancestral to the Casl proteins 
found in CRISPR-Cas systems and set out to investi- 
gate their provenance and potential function. 

Previous site-directed mutagenesis study has identified 
four conserved residues constituting the active site of Casl 
endonucleases (E141, H208, D218 and D221 in Escherichia 
coli Casl); alanine substitutions at any of these positions 
abolished the nuclease activity of the Casl against all sub- 
strates tested [37]. Examination of the multiple alignment 
of Casl-solo protein sequences showed that none of the 
group 1 members had the full complement of active site 
residues [see Additional file 1: Figure Sla], indicating that 
these proteins are unlikely to be active endonucleases (or 
less likely, rely on a different set of catalytic residues) and 
evolved under different constraints than the functional 
Casl proteins. In contrast, the four catalytic residues are 
strictly conserved in all Casl-solo proteins from group 2 
[see Additional file 1: Figure Sib]. Therefore, for further 
analysis we focused on group 2 of Casl-solo. 

It has been noted that group 2 casl-solo genes are often 
present in a conserved neighborhood that additionally in- 
cludes genes for a PolB-like polymerase, an HNH nuclease 
and two helix-turn-helix (HTH) domain-containing pro- 
teins [33]. To explore the phyletic distribution of such 
Casl -solo-containing genomic islands, we searched the 
available archaeal and bacterial genomes for co-occurrence 
of casl and polB genes. This analysis identified multiple 
'genomic islands', in addition to those previously reported 
[33]. In total, we detected 19 islands matching our criteria 
[see Additional file 1: Table SI]. In addition to archaea, such 
islands were detected in the genomes of several bacteria, 
namely Streptomyces albulus CCRC 11814 (Actinobacteria), 
Henriciella marina DSM 19595 (Alphaproteobacteria) and 
Nitrosomonas sp. AL212 (Betaproteobacteria) as well as 
on a genomic scaffold of an uncultured thermophilic 
bacterium Candidatus Acetothermum autotrophicum'. 
Phylogenetic analysis confirmed that all newly identified 
Casl proteins belong to the same clade of Casl-solo 
group 2 (Figure 1). Notably, the divergence of this Casl 
group appears to antedate the radiation of the three 
major types of CRISPR-Cas systems [35]. 

Discovery of casposons 

Gene content analysis showed that Casl-solo-encoding 
genomic islands from Thaumarchaeota contain PolBs that 
belong to the group of protein-primed polymerases. These 



polymerases are encoded by various viruses and eukaryotic 
self-synthesizing transposons of the Polinton/Maverick 
family [26,27] but generally not by cellular organisms. 
Thus, we hypothesized that these islands represent inte- 
grated MGE, analogous to the eukaryotic self-synthesizing 
transposons. DNA transposons typically possess TIR and 
upon integration into the genome often contain a specific 
mark, the target site duplication (TSD) which flanks the 
transposon [13,15]. We investigated the Casl and PolB- 
containing genomic islands for the presence of these fea- 
tures and found that in nearly all cases these loci were 
flanked by TIRs and direct repeats which correspond to 
TSD [see Additional file 1: Table SI]. None of these ele- 
ments contained identifiable genes for serine or tyrosine 
recombinases nor did they carry conserved transposase 
genes (see also below). The only enzyme that is consist- 
ently present in all these elements and, judged by its 
experimentally characterized activity, is capable of me- 
diating the integration of the elements into the host 
genome is Casl. Accordingly, we denote this new group 
of transposon-like elements 'Casposons'. The conserva- 
tion of polB genes places casposons as a new (super) 
family into the class of self- synthesizing large DNA 
transposons [14]. 

TIRs, TSDs and integration sites 

The unique casposon TIRs are highly variable in length (25 
to 602 nucleotides, median of 56) and could be identified 
in all casposons, except for the three closely related ele- 
ments (MetBur-Cl to C3) in the genome of Methanococ- 
coides burtonii DSM 6242 [see Additional file 1: Figure S2]. 
Some of the TIRs contain internal palindromic sequences 
[see Additional file 1: Figure S2}. 

The TSDs result from the fill-in repair of staggered nicks 
introduced by transposases at the target site upon insertion 
of MGE [15,41]. The length of the TSD differs depending 
on the transposase involved but in addition varies within 
as well as between transposon families [13,15]. The great 
majority of casposons are flanked by perfect direct repeats 
corresponding to TSD and ranging in length from 1 to 27 
nucleotides (median of 15; Additional file 1: Table SI). In a 
substantial fraction of the identified casposons, one or 
both TIRs partially overlap with the TSDs [see Additional 
file 1: Table SI and Figure S2], suggesting that, prior to 
integration, these casposons contained short terminal 
overhangs that were partially complementary to the 
staggered ends of the nicked target site. By contrast, 
the casposons in which the overlaps between the TSDs 
and TIRs were not present likely had blunt termini 
prior to integration. 

Most transposons do not display strong target site pref- 
erence but some are known to integrate site-specifically 
[42,43]. Ten casposons were found to be inserted into 
intergenic loci whereas for eight others, the target sites 
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Figure 1 Phylogeny of Cas1 proteins. Cas1 proteins encoded by casposons are represented in the framework of the Casl sequences 
representing the major types of CRISPR-Cas system. All clades including Cas1 from different CRISPR-Cas systems as well as Cas1-solo group 1 were 
collapsed for clarity (altogether, 52 non-casposon Cas1 sequences were analyzed, a representative subset of a larger collection of Cas1 sequences 
analyzed previously [33]; the Cas1 sequence alignment used to generate the tree is provided in Additional file 2, whereas the tree in which all 
branches are expanded is shown in Additional file 1: Figure S5). Numbers at the branch points represent RELL (resampling of estimated log-likelihoods)-like 
local support values calculated by FastTree. 



overlapped with coding sequences. The target sites of the 
three complete thaumarchaeal casposons were located 
within the 3 '-distal region of the gene encoding the trans- 
lation elongation factor aEF-2, whereas in five euryarch- 
aeal casposons the target site overlapped four to seven 3'- 
distal nucleotides of different tRNA genes [see Additional 
file 1: Table SI]. Notably, eukaryotic transposons of the 
recently described DADA superfamily integrate site- 
specifically into snRNA and tRNA genes [42]. However, 
unlike DADA transposons, which integrate close to the 
anticodon loop of tRNA genes, casposons do not alter the 
sequence of their target genes (either tRNA or aEF-2) and 
are located proximal to these genes. This pattern of in- 
tegration is reminiscent of the bacterial Tn7 transposon 
which recognizes the 3 '-distal region of the highly con- 
served glutamine synthetase gene (glmS) but inserts 
downstream of the glmS coding region, preserving the 
integrity of the latter [43]. Such a strategy ensures that 



the integration of Tn7 and casposons does not disrupt 
genes essential for host viability, thereby ensuring suc- 
cessful propagation of both the host and the respective 
MGE. 

Casposon mobility 

In most cases, when complete genome sequences are 
available, casposons are present in one copy per genome, 
consistent with their site-specific integration. However, 
M. burtonii DSM 6242 encompasses three closely related 
casposons (MetBur-Cl to -C3) which are adjacent on 
the genome [see Additional file 1: Figure S3a], suggest- 
ing recent activity of casposons in this archaeon. Not- 
ably, MetBur-C3 appears to be inactivated because two 
of the genes in this element contain amber mutations. 
Again, a parallel with the Tn7 transposon can be drawn. 
Tn7 is usually present in a single copy per genome. 
However, with lower efficiency, additional copies can be 
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integrated into the same target site, forming islands of 
tandem transposons, some of which become inactivated 
with time [44]. M. psychrophilus R15 is another organism 
in which remnants of a second, adjacent casposon are 
present [see Additional file 1: Figure S3b]. The TIRs of the 
latter element could not be identified, suggesting that it 
might be in the process of deterioration. The patchy taxo- 
nomic distribution of casposons in archaeal and bacterial 
genomes as well as their amplification in certain organ- 
isms suggests that they are active mobile elements. How- 
ever, the possibility that the amplification of casposons in 
M. burtonii DSM 6242 and M. psychrophilus R15 genomes 
is a result of segmental duplication cannot be ruled out. 
Experimental study of casposon integration and excision, 
as well as analysis of many more complete archaeal ge- 
nomes, is necessary to provide definitive answers regard- 
ing the mobility of these MGE. 

Classification of casposons 

Based on the gene content, taxonomic distribution and 
specific relationships between the Casl proteins, caspo- 
sons can be classified into three families (Figure 2). All four 
family 1 casposons are found in the genomes of different 
ammonia-oxidizing species of the thaumarchaeal genus 
Nitrosopumilus isolated from marine sediments [45,46]. 
The NitSJ-Cl casposon from Nitrosopumilus sp. SJ is 
nearly identical to NitARl-Cl from Candidatus Nitroso- 
pumilus koreensis AR1, except for two single-nucleotide 
deletions in the latter. Otherwise, similarity between fam- 
ily 1 casposons is limited to five universally present genes, 
including casl, polB and three small genes of unknown 
function (Figure 2). Casl proteins of family 1 casposons 
form a separate clade in the phylogenetic tree (Figure 1) 
and are more compact compared to the homologs from 
other casposons, with no additional protein domains (see 
below). 

The defining feature of family 1 casposons is that they 
carry a gene for a protein-primed PolB. To investigate the 
relationship between family 1 casposons and other protein- 
primed PolB-encoding MGE, in particular the eukaryotic 
Polinton/Maverick transposons, we performed a phylogen- 
etic analysis of the corresponding PolBs from a wide range 
of viruses, plasmids and transposons (Figure 3). In the 
resulting tree, the casposon PolBs form a sister group to 
PolBs from the halophilic archaeal viruses Hisl and His2 
[47], and this archaeal clade is embedded deep within the 
clade of prokaryotic MGE that in addition includes several 
viral families. Although initially considered to be related 
(mainly due to the presence of homologous polB genes), 
recent analysis has shown that Hisl and His2 belong to dif- 
ferent virus families [48,49], suggesting that the polB genes 
have been acquired independently by the ancestors of the 
two viruses. Although currently available data do not allow 
one to unequivocally infer the directionality of the gene 



transfer, it seems likely that family 1 casposons were the 
donors of the PolB gene for both Hisl and His2 viruses. 
Clearly, more viral and casposon sequences are needed to 
ascertain the directionality of polB gene transfer between 
these different types of elements. More importantly, phylo- 
genetic analysis of the PolB proteins (Figure 3) shows that 
despite sharing a number of features, including size, pres- 
ence of TIRs and genes for protein-primed PolB, casposons 
are not related to the eukaryotic self-synthesizing transpo- 
sons by descent but rather are analogous to them. Indeed, 
the only feature shared between casposons and Polintons/ 
Mavericks is that both types of elements are predicted to 
replicate using self-encoded DNA polymerases which be- 
long to the same family but do not form a clade (Figure 3). 
Nevertheless, this shared property defines the class of self- 
synthesizing large DNA transposons [13-15]. 

Family 2 casposons are present in diverse members of 
the archaeal phylum Euryarchaeota [see Additional file 1: 
Table SI], including the unclassified human gut-associated 
methanogen Methanomassiliicoccus luminyensis BIO [50] 
as well as the hyperthermoacidophile A. boonei T469 [51]. 
PolBs encoded by casposons of family 2 are related to the 
PolB3 family of typical archaeal RNA-primed DNA poly- 
merases [see Additional file 1: Figure S4] [52,53]. The cas- 
poson polymerases form a sister group to a small clade of 
PolBs from Thermoproteales (phylum Crenarchaeota). 
Notably, in the latter clade the polB gene of Ignisphaera 
aggregans DSM 17230 is located within an integrated mo- 
bile element which is unrelated to casposons and carries a 
gene for a tyrosine integrase (KSM, MK, EVK, unpub- 
lished work). This observation suggests that, as with the 
family 1 casposons, there could have been exchange of 
PolB genes between family 2 casposons and other types of 
MGE. Casl proteins of all family 2 casposons contain a 
conserved C-terminal fusion of an HTH domain which is 
not found in any other Casl proteins. Notably, a similar 
HTH domain is also found in the C-termini of PolBs of 
family 2 casposons. Although family 2 casposons vary con- 
siderably in size (6 to 16 kb) and gene content, most of 
them share a core of five genes which encode Casl, PolB, 
an HNH endonuclease and two distinct HTH proteins 
(Figure 2). One of the conserved HTH proteins contains a 
C-terminal HEAT repeat domain (PF02985); HEAT re- 
peats form rod-like helical structures that mediate vari- 
ous protein-protein interactions [54]. The conserved 
HTH proteins and the HNH endonuclease might be in- 
volved in the target site recognition and subsequent 
casposon integration, in concert with Casl. This mech- 
anism of integration would resemble integration of the 
site-specific transposon Tn7 mentioned above. The het- 
erotrimeric Tn7 transposase TnsABC binds the termini 
of the transposon whereas targeting to the specific re- 
gion of the host glmS gene is mediated by the sequence- 
specific DNA-binding protein TnsD [43]. 
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Figure 2 Casposon genome maps. The three families of casposons are indicated. The precise nucleotide coordinates of depicted casposons 
can be found in Additional file 1: Table SI. Predicted protein-coding genes are indicated with arrows, indicating the direction of transcription. 
The color key for the designation of the common genes is shown in the top right area of the figure. 'HTH (other)' denotes proteins that are not 
orthologous but nevertheless contain HTH domains. Terminal inverted repeats (TIR) are shown with black rectangles and their sequences are 
shown in Additional file 1: Figure S2. The grey boxes outlined with a broken line in MetPsy-C1 depict duplicated regions. The striped green arrows 
represent genes encoding divergent HNH proteins. Abbreviations: ZBD, zinc-binding domain-containing protein; HNH, HNH family endonuclease; 
HTH, helix-turn-helix proteins; MTase, methyltransferase; RHH, ribbon-helix-helix protein; REase, restriction endonuclease; AP, apurinic/apyrimidinic; 
UDG, uracil-DNA glycosylase; SFI, superfamily I; IR, inverted repeat. 
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Figure 3 Phylogeny of protein-primed B family DNA polymerases. Clades that are only distantly related to the casposon-encoded proteins 
were collapsed. The tree is rooted with phi29-like bacteriophages of the Podoviridae family. Numbers at the branch points represent RELL 
(resampling of estimated log-likelihoods)-like local support values calculated by FastTree. 



Family 3 casposons are present in the genomes of dif- 
ferent bacteria, including an uncultivated thermophilic 
bacterium Candidatus 'Acetothermum autotrophicum' 
[see Additional file 1: Table SI]. In the Casl phylogeny, 
Family 3 casposons form a distinct clade that is a sister 
group to the rest of the casposons (Figure 1). By con- 
trast, in the PolB tree, the Family 3 clade emerges from 
within the Family 2 casposons [see Additional file 1: 
Figure S4], compatible with the possibility that casposons 
emerged in archaea and were horizontally transferred to 
bacteria subsequent to the divergence of the casposon 
families 1 and 2. Casl protein of NitAL212-Cl con- 
tains a zinc-binding domain (ZBD) and a HTH domain 
fused to the N- and C-termini of the Casl domain, re- 
spectively, whereas in AceAut-Cl both ZBD and HTH 
are fused to the C-terminus of the Casl domain 
(Figure 2). The Casl proteins of StrAlb-Cl and 
HenMar-Cl do not contain any additional domains, 
similar to the Casl of Family 1 casposons. Three of the 
four group 3 casposons contain genes for the HNH 
endonuclease and a conserved HTH protein shared 
with the group 2 casposons (Figure 2). StrAlb-Cl en- 
codes only a homolog of the HNH endonuclease al- 
though a gene for an unrelated HTH protein, which 
might be functionally equivalent, was also identified 
(Figure 2). Thus, both the PolB phylogeny and the 
comparison of the sets of predicted genes point to an 
affinity between the casposon families 2 and 3. 

Casposon gene repertoire 

Casposons vary greatly in terms of gene content, both 
within and between the three families described above, and 
carry many lineage-specific genes. Virtually all of the genes 



for which functions could be inferred are predicted to be 
involved in various DNA manipulations. Three consistent 
themes could be discerned among the products of the cas- 
poson genes (Figure 2 and Additional file 1: Table S2). 

The first group of proteins includes predicted nucleases 
potentially involved in casposon integration/excision. In 
addition to the Casl endonuclease, the hallmark 
casposon protein, this group includes the HNH 
endonuclease that is present in all Family 2 and Family 3 
casposons and is likely to cooperate with Casl in the 
integration and excision processes. MetLum-Cl and 
MetPsy-Cl contain genes for apurinic/apyrimidinic (AP) 
endonucleases, whereas MetPsy-Cl also encodes a 
GIY-YIG nuclease and a uracil-DNA glycosylase (UDG) 
that might be involved in the repair of the termini 
following casposon integration. MetTin-Cl encodes an 
exonuclease which could contribute to the processing of 
the casposon termini. 

The second group of casposon proteins is implicated in 
DNA replication. Besides the two types of PolB genes, 
many casposons carry genes for various helicases which 
might assist during the replication of the casposon DNA. 
Notably, AceAut-Cl encodes not only a HerA-like 
helicase but also a putative DnaC-like helicase loader as 
well as an additional protein containing the AAA + 
ATPase domain, which is found in many helicases, 
including MCM [55]. HenMar-Cl encodes a, so far, unique 
fusion protein containing an N-terminal nuclease domain 
related to the Cas4-like proteins of the CRISPR-Cas 
systems and a C-terminal superfamily I helicase domain. 
The third category consists of various small 
DNA-binding proteins containing HTH, ZBD or 
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ribbon-helix-helix (RHH) domains. Various 
combinations of these proteins are encoded in most 
casposons (Figure 2), and it cannot be ruled out that 
some of the uncharacterized small proteins contain 
highly derived versions of DNA-binding domains. 
These proteins could contribute to both integration/ 
excision and replication of the casposons, and in addition, 
some of them might regulate expression of casposon and/ 
or host genes. 

In addition, casposons encode various enzymes some of 
which are associated with genome, RNA or chromatin 
modification whereas others are implicated in defense 
functions and metabolic processes [see Additional file 1: 
Table S2]. Three casposons encode highly divergent Sir2- 
like proteins; in Sulfolobus, Sir2 has been shown to deace- 
tylate the major archaeal chromatin protein, Alba, thereby 
modulating the chromatin structure [56]. Although the 
casposon-encoded Sir2-like proteins might also function 
as deacetylases, alternative enzymatic activities of these 
derived Sir2 homologs cannot be ruled out. By contrast, 
AmiAut-Cl encodes a unique fusion protein containing 
an N-terminal GCN5 acetyltransferase domain and a 
C-terminal queuine tRNA-ribosyltransferase domain. In 
addition, AciBoo-Cl and HenMar-Cl encode N6 and C5 
DNA methyltransferases, respectively, whereas MetLum- 
Cl encodes a putative restriction endonuclease. Other 
notable proteins encoded by casposons include a KAP 
family P-loop ATPase [57] (MetMah-Cl), Ser/Thr kin- 
ase (HenMar-Cl) and a Sm-like RNA-binding protein 
(AmiAut-Cl). 

Two casposons, MetArv-Cl and NitAL212-Cl, carry in- 
sertion sequence (IS) elements of the families ISNCY and 
IS1595, respectively [58]. In both cases, the IS transposase 
genes are flanked by typical short inverted repeats and 
TSDs, indicating that the IS elements parasitize casposons 
rather than participate in their propagation. The sporadic 
conservation of functionally diverse genes in distinct cas- 
posons, even those that belong to the same family, indi- 
cates that, similar to viruses, casposons can horizontally 
acquire genes from various sources. 

Discussion 

Transposons as a type of MGE are polyphyletic with re- 
spect to the enzymes mediating their transposition 
[10,13,15]. Here, we described a new type of mobile ele- 
ments, the casposons, which appear to rely on Casl-like 
endonucleases for genome integration. Structures of sev- 
eral Casl proteins have been solved [37,39,40] showing 
that Casl proteins adopt a novel structural fold, unre- 
lated to the folds of any of the transposases described to 
date. Nevertheless, casposons share a number of features 
with known DNA transposons. On the one hand, they 
appear to behave as site-specific transposons, akin to the 



bacterial transposon Tn7 [43,44]. On the other hand, the 
molecular structure of casposons is highly reminiscent 
of the eukaryotic self-synthesizing DNA transposons of 
the Polinton/Maverick superfamily [26,27]. Similar to 
the Polintons/Mavericks, casposons possess TIRs and 
encode their own DNA polymerase genes which in fam- 
ily 1 casposons belong to the same, protein-primed class 
as the polymerase of Polintons/Mavericks. 

Based on the model previously proposed for Polinton/ 
Maverick transposons [26], we hypothesize that casposon 
DNA replication proceeds via a single-stranded (ss) DNA 
intermediate and primarily depends on the casposon- 
encoded PolB (Figure 4a). First, during cellular DNA repli- 
cation, the casposon sequence is likely to loop-out on the 
lagging strand due to the formation of a double-stranded 
stem involving the TIR sequences. The next stage involves 
Casl-catalyzed excision of the casposon. Importantly, 
Casl from E. coli has been shown to act efficiently on 
different branched DNA substrates [37], including the 
splayed-arm duplex DNA which is similar to the looped- 
out casposon intermediate depicted in Figure 4a. The TIRs 
of the excised ssDNA casposon would form the panhandle 
structures which serve as the replication origin in various 
viruses and plasmids encoding protein-primed PolBs 
[59,60]. In the latter systems, replication is primed by a 
MGE-encoded protein that covalently binds to the 5'- 
terminus of the nascent strand. This is likely to also be the 
case for the group 1 casposons which encode protein- 
primed PolBs (Figure 3). Terminal proteins are known to 
be highly divergent but are typically encoded immediately 
upstream of the polB genes or as N-terminal fusions of 
the PolB proteins [59-61]. All family 1 casposons share an 
appropriately positioned conserved gene which could 
encode a terminal protein (Figure 2). Family 2 and 3 
casposons encode PolBs that are more closely related to 
the typical archaeal RNA-primed PolBs, suggesting that 
their replication is primed by the host primase. Eventually, 
a new double-stranded casposon copy is synthesized 
through the concerted activities of PolB and accessory 
host- and/or casposon-encoded replication proteins, 
such as helicases (Figure 4a). 

Given the presence of a PolB gene in all casposons, we 
propose to classify these elements as the second superfam- 
ily within the class of self-synthesizing DNA transposons, 
in addition to the Polintons/Mavericks [14]. Most of the 
polintons show a distinct virus-like character, with two 
conserved genes encoding major and minor capsid pro- 
teins, suggesting that these elements form virus particles 
under some circumstances and prompting their proposed 
re-classification as polintoviruses [24]. The casposons, 
however, do not encode any detectable homologs of capsid 
proteins and accordingly are likely to adhere to the trans- 
poson lifestyle. We define casposons as self-synthesizing 
MGE which rely on Casl for integration. Under this 
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Figure 4 Proposed mechanisms of casposon replication and integration, (a) Mechanism of casposon DNA excision and replication, 
(b) Mechanism of spacer acquisition by CRISPR-Cas system. Adapted from [34]. (c) Proposed mechanism of Casl -mediated casposon integration. 
See text for details. Subterminal regions within TIRs outlined with the broken line indicate that processing occurs only in a fraction of casposons. 
See text for details. Abbreviations: TIR, terminal inverted repeats; TSD, target site duplication. 



definition, casposons may encode distinct DNA poly- 
merases (as is indeed the case for family 1 compared to 
families 2 and 3) but should possess the ability to self- 
synthesize. 

The Casl endonuclease is the key player in the adapta- 
tion step of the CRISPR-Cas immunity. Consistent with its 
importance, Casl is the most stable and conserved com- 
ponent of functional CRISPR-Cas systems and is consid- 
ered a signature gene for these defense systems [32,33,35]. 
A model of spacer acquisition, which integrates the avail- 
able experimental data, has been proposed [30,34-36]. 
This model helps to predict the specific path of the Casl- 
mediated integration of casposons which is also consistent 
with the detailed analysis of the terminal sequences of 
integrated casposons [see Additional file 1: Figure S2]. 
Figures 4b and 4c depict the parallel flows of events 
underlying the insertion of new CRISPR spacers and cas- 
posons, respectively. In both cases, staggered nicks are in- 
troduced into the target sequence which in the case of 
CRISPR-Cas corresponds to the first repeat proximal to 
the leader sequence. In the case of some casposons, the 
TIR-containing termini are processed to produce short 
overhangs complementary to the tips of the nicked target 
site (see above). In the next step, the ends of the 



protospacer/casposon are joined to those of the nicked 
target site. The observation that in CRISPR-Cas systems 
Casl is the only protein whose enzymatic activity is essen- 
tial for integration of new spacers [62], a process that in- 
volves cutting and rejoining of the cellular DNA within 
the CRISPR repeat arrays, suggests that the casposon Casl 
also possesses both DNA cutting and joining activities. 
However, the latter activity of Casl remains to be demon- 
strated experimentally. Finally, the target site is fill-in 
repaired, completing the casposon/spacer insertion and 
resulting in the TSD (for casposons) or repeat duplication 
(in CRISPR-Cas) (Figure 4b, c). 

The discovery of casposons has important evolutionary 
implications. The deep branching of casposon Casl ho- 
mologs within the global Casl phylogeny (Figure 1) is 
compatible with the possibility that the Casl family of en- 
donucleases emerged in the context of mobile elements 
and only later was adapted for cellular defense. Conse- 
quently, we propose that casposons played a pivotal role 
in the origin of prokaryotic CRISPR-Cas immunity. The 
origin of Casl appears not to be the only contribution of 
transposable elements to the evolution of CRISPR-Cas. In- 
deed, recent comparative genomic analysis of the type II 
CRISPR-Cas systems has shown that Cas9, the key protein 



Krupovic et al. BMC Biology 2014, 12:36 
http://www.biomedcentral.eom/1 741 -7007/1 2/36 



Page 10 of 12 



of these systems involved in the RNA processing and 
interference stages, most likely, also evolved from a dis- 
tinct class of transposon proteins [63]. 

It has been previously hypothesized that the CRISPR- 
Cas system originated in archaea [32], and the present 
observations on the likely archaeal origin of casposons 
appear compatible with this hypothesis. However, it can- 
not be ruled out that similar to some other MGE, cas- 
posons are even more ancient and antedate advanced 
cellular life forms [23]. 

Strikingly, transposons can also be placed at the root 
of adaptive immunity in eukaryotes. The RAG1 protein, 
which plays a central role during the V(D)J recombination, 
was derived from the DDE transposase of Transib trans- 
posons [64]. The parallel contribution of transposons to 
the origin of adaptive immunity in prokaryotes and eu- 
karyotes emphasizes that MGE are the molecular archi- 
tects behind some of the major evolutionary innovations 
of their hosts, in particular, the cellular defense systems 
[23,65]. More specifically, given the mechanistic similarity 
between MGE transposition and integration, on the one 
hand, and insertion of spacers by the CRISPR-Cas system 
and immunoglobulin gene rearrangement, on the other 
hand, integrases and transposases appear to be ready- 
made tools that can be recruited and utilized by adaptive 
immunity systems. 

Conclusions 

The diversity of MGE is astounding and is far from be- 
ing fully explored. This state of affairs is well illustrated 
by the discovery of casposons described here. Casposons 
constitute the second superfamily of self-synthesizing 
transposon-like MGE, beside the eukaryotic Polinton/ 
Maverick transposons, and are the first representatives 
of this class of elements in prokaryotes. Different MGE 
have evolved a number of unrelated molecular mecha- 
nisms to perform similar tasks that ensure their propa- 
gation within the host cells. The casposons, so far, are 
unique as the only group of MGE that apparently rely 
on Casl endonucleases, a key component of the pro- 
karyotic CRISPR-Cas defense system, for insertion into 
and excision from the host genome. The perennial arms 
race between cellular organisms and various MGE 
seems to be one of the major driving forces underlying 
the evolution of both interacting parties and it is be- 
coming increasingly clear that cells and MGE exchange 
molecular inventions that emerge in the process of this 
evolutionary struggle. The adaptive immunity of both 
prokaryotes and eukaryotes apparently evolved via recruit- 
ment of recombinases from distinct MGE, the casposons 
and the Transib family transposons, respectively. Although 
this route of evolution seems paradoxical given that MGE 
are the primary targets of the immunity systems, it is be- 
coming clear that throughout the course of evolution, 



MGE have served as a rich source of naturally evolved 
tools for cellular genome engineering that had a major 
impact on the diversification of cellular organisms. 

Methods 

Casposon protein sequences were analyzed using PSI- 
BLAST [66], CD-Search [67], and HHpred [68]. Inverted 
and direct repeats flanking the casposons were analyzed 
using Unipro UGENE [69]. The palindromic repeats 
within the casposon TIR sequences were identified using 
Mfold [70] . Insertion sequences were analyzed using ISfin- 
fer [71]. Multiple sequences alignments were built using 
Promals3D [72] and Muscle [73]. The Polinton/Maverick 
PolB sequences were recovered from the Repbase Update 
database [74]. For phylogenetic analysis, gapped columns 
(more than 30% of gaps) and columns with low informa- 
tion content were removed from the alignment [75]. 
Phylogenetic analysis was carried out by using FastTree 
[76], with the Jones-Taylor- Thornton model of amino 
acid evolution and y-CAT estimation of evolutionary rates 
across sites. The trees were visualized using MEGA6 [77]. 
For the Casl phylogeny, Casl protein sequences repre- 
senting all major types and subtypes of the CRISPR-Cas 
systems were obtained from [33] and supplemented with 
the casposon-encoded Casl protein sequences. The Casl 
sequence alignment used to generate the tree is provided 
in Additional file 2. 

Additional files 



Additional file 1: The file contains Figures SI to S5 and Tables SI 
and S2. Figure SI. Multiple sequence alignments of the two groups of 
Casl -solo proteins. Figure S2. Analysis of the casposon terminal inverted 
repeats and target site duplications. Figure S3. Genomic loci showing the 
amplification of casposons. Figure S4. Phylogeny of 
RNA-primed type B DNA polymerases. Figure S5. Phylogeny of Casl 
proteins. Table SI. Major characteristics of bacterial and archaeal 
casposons. Table S2. Annotation of the casposons. 

Additional file 2: The file contains the multiple sequence alignment 
of Casl proteins used to generate the phylogenetic trees shown in 
Figure 1 and Figure S5. 
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